Brain Vignette

Welcome to Amethyst! The goal of Amethyst is to facilitate comprehensive analysis tools for single-cell methylation data. If you use this package or code on our Github for your work, please cite our manuscript.

PACKAGE INSTALLATION

If you are new to R, you may need to install some of the dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

library("BiocManager")
BiocManager::install(c("caret", "devtools", "data.table", "dplyr", "furrr", "future", "future.apply",
  "ggplot2", "grDevices", "gridExtra", "igraph", "irlba", "janitor", "Matrix", "methods", "pheatmap",
  "plotly", "plyr", "purrr", "randomForest", "rhdf5", "rtracklayer", "Rtsne", "scales", "stats", "stringr", 
  "tibble", "tidyr", "umap", "utils"))

Next, install a few additional dependencies found on Github, including amethyst itself.

devtools::install_github("JinmiaoChenLab/Rphenograph")
devtools::install_github("KrishnaswamyLab/MAGIC/Rmagic")
devtools::install_github("TomKellyGenetics/leiden")
devtools::install_github("lrylaarsdam/amethyst")

Now load libraries into R:

library(amethyst)
library(data.table)
library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(plyr)
library(future)
library(furrr)
library(purrr)
library(cowplot)
library(leiden)
library(pheatmap)

LOADING PRACTICE DATA

This vignette comes with a toy dataset of 50 high-coverage cells derived from the human brain cortex. It is similar to the PBMC vignette, but contains some added features as the brain uniquely has very high levels of CH methylation in addition to CG. These two modalities operate by different principles and therefore necessitate tailored analysis approaches. However, processing CH methylation can take a long time, as there about half a billion sites in the human genome. To make this illustration quickly executable on any local computer, we have already done initial processing for computationally-intensive steps. These can be downloaded with a pre-made workspace in the next section. To do all steps from scratch, we recommend going through the CG-focused PBMC vignette.

Note: By default, data will download to the “~/Downloads/” folder. Change if a different directory is desired.

# h5 file containing base-level CG methylation information. This file is 12 GB and not necessarily needed for this tutorial, but available if desired.
# download.file("https://adeylabopen.s3.us-west-2.amazonaws.com/amethyst/brain_vignette.h5", "./brain_vignette.h5", method = "curl") 
# workspace containing initial CH processing steps
download.file("https://adeylabopen.s3.us-west-2.amazonaws.com/amethyst/brain_vignette_workspace.RData", "~/Downloads/brain_vignette_workspace.RData", method = "curl") 
# metadata files
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/brain_vignette/brain_vignette_cellInfo.txt", "~/Downloads/brain_vignette_cellInfo.txt")
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/brain_vignette/brain_vignette.annot", "~/Downloads/brain_vignette.annot", method = "curl")

ASSEMBLING THE AMETHYST OBJECT

First, we must construct an amethyst object, which stores the path to the h5 file as well as a lot of other information that will be calculated downstream. This has already been initialized with the createObject command in the process of pre-calculating computationally intenstive steps. Load the workspace to see a summary of the object so far. Please see the PBMC vignette for a detailed description of how to construct the object.

# obj <- createObject()
load("~/Downloads/brain_vignette_workspace.RData")
summary(obj)
##   Length    Class     Mode 
##        1 amethyst       S4

Next, we need to add metadata about each cell. Useful metadata includes quality control metrics contained in the .cellInfo.txt intermediate output or .annot files, if using the Premethyst workflow. We also need to specify the location of the h5 file containing site-level methylation data for each barcode. In this case, every barcode belongs to the same h5 file, but an unlimited number of h5 files can be used in the same object.

Note: If you are combining experiments with overlapping barcodes, please see our vignette for addressing this. The h5paths slot is set up differently. Also, please note we have changed the heading from ‘paths’ (v0.0.0.9000) to ‘path’ (v1.0.0) to be more consistent, given the addition of the barcode and prefix columns.

obj <- addCellInfo(obj, file = "~/Downloads/brain_vignette_cellInfo.txt")
obj <- addAnnot(obj, file = "~/Downloads/brain_vignette.annot", name = "method") 
obj@h5paths <- data.frame(barcode = rownames(obj@metadata), path = rep("~/Downloads/brain_vignette.h5", length(rownames(obj@metadata))))

Filtering

While not essential, it can be helpful to filter cells right away so downstream functions don’t perform calculations for cells that will not be used. This can easily be done with dplyr logic.

ggplot(obj@metadata, aes(x = cov)) + geom_histogram(bins = 10) + theme_classic()

obj@metadata <- obj@metadata |> dplyr::filter(cov > 1000000 & mch_pct < 12)

CLUSTERING

The next step is to cluster cells, which we typically do based on methylation values over fixed genomic windows. Depending on the size and depth of your dataset, calculating methylation levels over features can be a computationally intensive step. There are three ways to do this (only do one):

  1. Calculate genomic windows in R with makeWindows
  2. Use Facet to pre-compute methylation levels over genomic regions and store in the h5 file
  3. Use a different method - like MethSCAn - and add results to the Amethyst object

Note: Clustering methods for single-cell methylation data, as a newer modality, will require more hands-on interpretation of results. Please thoroughly check that your results are not driven by technical issues (like coverage bias) and expect that this step will take some optimization.

1. Calculate methylation levels over genomic windows in R using Amethyst functions

The most straightforward method is to calculate methylation levels over features with Amethyst in R. First, perform an initial indexing step, which determines the locations corresponding to each chromosome in every h5 file. This reduces the computational load by only calculating across one at a time.

Note: If you are using non-conventional chromosomes, please provide a whitelist (see function parameters).
Note: You may have to copy/paste any code reading the h5 file (e.g., makeWindows) directly into the console instead of running the chunk, if using the .Rmd template.

#  don't run; the output of these steps is already in your workspace
obj@index[["chr_cg"]] <- indexChr(obj, type = "CG", threads = 1) 
obj@genomeMatrices[["cg_100k_score"]] <- makeWindows(obj, stepsize = 100000, type = "CG", metric = "score", threads = 1, index = "chr_cg", nmin = 2) 
obj@index[["chr_ch"]] <- indexChr(obj, type = "CH", threads = 1)
obj@genomeMatrices[["ch_100k_pct"]] <- makeWindows(obj, stepsize = 100000, type = "CH", metric = "percent", threads = 1, index = "chr_ch", nmin = 5)

2. Calculate methylation levels over genomic windows using Facet

makeWindows can be memory-intensive. If you have very large datasets, it may be preferable to apply Facet, a Python-based helper script to compute windows in a more efficient manner. Results are stored in the h5 file and loaded as needed. For installation instructions and examples, see Facet documentation. The test dataset already has 100kb window values calculated using the command below.

Note: Skip to the “clustering continued” section if you used makeWindows to calculate 100kb window methylation score.

# example of how to compute aggregated windows with facet
facet agg -u 100000 brain_vignette_facet.h5

If pre-computed windows are stored in the h5 file, load to the genomeMatrices slot with loadWindows. An indexing step is not needed for Facet.

obj@genomeMatrices[["cg_100k_score_facet"]] <- loadWindows(obj, name = "100000", nmin = 2, metric = "score")

3. Integrate an alternative method

There are many methods out there for clustering. One may wish to implement an alternative strategy, like clustering with MethSCAn variably methylated regions. This worked quite well for our small, high-coverage test dataset. Results from other methods can easily be dropped in to an Amethyst object and analysis can proceed as normal. Please see our vignette for example alternative clustering methods. In the end, all you have to do is assign the output to a new reductions result:

# theoretical example of 2D umap projection from methscan vmrs - do not run
obj@reductions[["umap_methscan_vmrs"]] <- methscan_result
colnames(obj@reductions[["umap_methscan_vmrs"]]) <- c("dim_x", "dim_y")

Clustering with Amethyst or Facet - continued

You may want to remove windows where most values are NA. In this case, since the vignette data is high coverage and the genomic windows are large, I am going to filter for at least 90% of the cells have values in that window. The appropriate threshold will highly depend on how big the windows are and how many cells you have. We will also remove chrY windows because they tend to be sparse and can induce clustering artifacts.

Note: Please adjust this threshold in a dataset-specific manner. If you use short windows or have lower-coverage data, a 90% threshold is way too high.

# remove windows with < 90% observed values
obj@genomeMatrices[["cg_100k_score"]] <- obj@genomeMatrices[["cg_100k_score"]][rowSums(!is.na(obj@genomeMatrices[["cg_100k_score"]])) >= nrow(obj@metadata)*.9, ]
obj@genomeMatrices[["ch_100k_pct"]] <- obj@genomeMatrices[["ch_100k_pct"]][rowSums(!is.na(obj@genomeMatrices[["ch_100k_pct"]])) >= nrow(obj@metadata)*.9, ]

# remove chrY windows
obj@genomeMatrices[["cg_100k_score"]] <- obj@genomeMatrices[["cg_100k_score"]][!sapply(rownames(obj@genomeMatrices[["cg_100k_score"]]), function(name) grepl("chrY", name)), ]
obj@genomeMatrices[["ch_100k_pct"]] <- obj@genomeMatrices[["ch_100k_pct"]][!sapply(rownames(obj@genomeMatrices[["ch_100k_pct"]]), function(name) grepl("chrY", name)), ]

Next, perform dimensionality reduction. If you are unsure how many dimensions to use, the dimEstimate function can estimate the number needed to explain the desired variance threshold. In this example, we will reduce the data from “cg_100k_score” and “ch_100k_pct” into five dimensions using the irlba package, which performs fast truncated singular value decomposition. Feel free to implement alternative methods if desired.

set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["irlba_ch100k_cg100k"]] <- runIrlba(obj, genomeMatrices = c("ch_100k_pct", "cg_100k_score"), dims = c(5, 5), replaceNA = c("mch_pct", 0))

Now determine cluster membership with runCluster. Either a Louvain-based or Leiden-based strategy can be used. Assign the output column name, which will be added to the metadata. This update was implemented so one can save multiple iterations within the same object.

Note: In this example, k is low because brain_vignette.h5 has 50 cells. Increase to 50 or so for larger datasets, depending on number of cells and expected heterogeneity.

set.seed(111)
# run clustering with louvain-based method
obj <- runCluster(obj, k = 10, reduction = "irlba_ch100k_cg100k", method = "louvain", colname = "louvain_cluster") 
obj <- runCluster(obj, k = 10, reduction = "irlba_ch100k_cg100k", method = "leiden", colname = "leiden_cluster") 

You can use the head function to look at how results are stored:

head(obj@metadata)
##                                    cov  cg_cov mcg_pct   ch_cov mch_pct method
## ATTGAGGATAACGCTTCGTCCGCCGATC 106462778 4372087   77.09 86598316    3.09     LA
## ATTGAGGATAACGCTTCGTCTAAGGTCA 116628882 4956483   75.33 94834293    5.30     LA
## ATTGAGGATACATTACCAAGACCTTGGC 102852337 4536119   75.48 83270322    5.99     LA
## ATTGAGGATACCAATTCCATCACGAGCG 104380792 4490347   70.24 84576070    1.21     LA
## ATTGAGGATACCAATTCCATTTGCCTAG 105518604 4616346   75.88 85112793    5.80     LA
## ATTGAGGATACTACTGCAAGATGGCATG 110343538 4985043   72.10 88886888    1.75     LA
##                              louvain_cluster leiden_cluster
## ATTGAGGATAACGCTTCGTCCGCCGATC               4              4
## ATTGAGGATAACGCTTCGTCTAAGGTCA               1              2
## ATTGAGGATACATTACCAAGACCTTGGC               1              2
## ATTGAGGATACCAATTCCATCACGAGCG               2              5
## ATTGAGGATACCAATTCCATTTGCCTAG               1              2
## ATTGAGGATACTACTGCAAGATGGCATG               3              1

Next, project to 2D with runUmap and/or runTsne. Assign the output to a reductions slot. Like the runCluster update, this allows multiple projections to be stored within the same object.

Note: Neighbors and perplexity values are small because this dataset is 50 cells. Increase for larger datasets.

set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["umap_irlba_ch100k_cg100k"]] <- runUmap(obj, neighbors = 5, dist = 0.05, method = "euclidean", reduction = "irlba_ch100k_cg100k") 
obj@reductions[["tsne_irlba_ch100k_cg100k"]] <- runTsne(obj, perplexity = 5, method = "euclidean", theta = 0.2, reduction = "irlba_ch100k_cg100k") 

Visualizing the results

First, plot the UMAP or TSNE coordinates of the cells with the color corresponding to cluster membership.

p1 <- dimFeature(obj, colorBy = louvain_cluster, reduction = "umap_irlba_ch100k_cg100k", pointSize = .8) + ggtitle("UMAP CH 100k pct + CG 100k score; Louvain")
p2 <- dimFeature(obj, colorBy = leiden_cluster, reduction = "tsne_irlba_ch100k_cg100k", pointSize = .8) + ggtitle("TSNE CH 100k pct + CG 100k score; Leiden")
plot_grid(p1, p2)

dimFeature uses ggplot logic, so you can easily modify plots as needed. Below is an example of cells divided by a simulated “batch” parameter.

# Adding a random variable "batch" to illustrate faceting
set.seed(111)
obj@metadata$batch <- sample(1:3, nrow(obj@metadata), replace = TRUE)
dimFeature(obj, colorBy = louvain_cluster, reduction = "umap_irlba_ch100k_cg100k", pointSize = .8) + facet_wrap(vars(batch), nrow = 1) 

Show the distribution of cluster membership between samples with sampleComp. Plots can be easily modified with ggplot command logic.

Note: Since there are so few cells, the variation in proportion between batches is not concerning.

sampleComp(obj, groupBy = "batch", colorBy = "louvain_cluster") 

If you want to make the UMAP/TSNE plots look nicer, Amethyst provides many built-in color palettes:

testPalette(output = "swatch", n = length(unique(obj@metadata$louvain_cluster)))

Assign your favorite to a name as a shortcut for future plotting calls.

library(ggrepel)
group_labels <- merge(obj@metadata, obj@reductions[["umap_irlba_ch100k_cg100k"]], by = 0) |> dplyr::group_by(louvain_cluster) |> dplyr::summarise(dim_x = mean(dim_x), dim_y = mean(dim_y))

pal <- makePalette(option = 13, n = 5)
dimFeature(obj, colorBy = louvain_cluster, colors = pal, pointSize = .8, reduction = "umap_irlba_ch100k_cg100k") +
  geom_text_repel(data = group_labels, aes(x = dim_x, y = dim_y, label = louvain_cluster))

In addition to cluster distribution, dimFeature is useful for visualizing how different parameters in the cellInfo file are distributed throughout the UMAP. From global %mCH alone we can tell right away that clusters 1 and 5 are neurons. In the groups that have lower %mCH levels, we can hypothesize that 2 are microglia, 3 are oligodendrocytes, and 4 are astrocytes.

p1 <- dimFeature(obj, reduction = "umap_irlba_ch100k_cg100k", colorBy = log(cov), pointSize = .8) + 
  scale_color_gradientn(colors = c("black", "turquoise", "gold", "red")) + ggtitle("Coverage distribution") 
p2 <- dimFeature(obj, reduction = "umap_irlba_ch100k_cg100k", colorBy = log(mch_pct), pointSize = .8) + 
  scale_color_gradientn(colors = c("black", "turquoise", "gold", "red")) + ggtitle("Global %mCH distribution") 
plot_grid(p1, p2)

ANNOTATION

Now that we have clusters, the next step is determining their biological identity. This can be quite challenging for a novel modality like methylation. In addition to leveraging global %mCH, there are a couple ways we recommend going about this:

  1. mCG hypomethylation over regulatory elements of canonical marker genes
  2. mCH hypo/hypermethylation over canonical marker genes (brain-specific)
  3. Compare to a reference if possible

1. Use mCG patterns over canonical marker genes

If you have an in-depth knowledge of the system, one useful method is to look at mCG hypomethylation over canonical marker genes. The first step is to load an annotation file for the reference genome so Amethyst knows the coordinates for each gene.

obj@ref <- makeRef("hg38")

Next, calculate methylation levels in short genomic windows for each cluster. We recommend 500bp windows for real data, but 1000 are used here since the vignette dataset is only 50 cells.

Note: To avoid the hassle of recomputing these every time clusters are adjusted, one can also calculate sliding smoothed windows with Facet for each cell and store in the h5 file, then re-calculate/load more rapidly per group with loadSmoothedWindows. See ?loadSmoothedWindows for parameter specifications.

cluster1kbwindows_cg <- calcSmoothedWindows(obj, 
                                         type = "CG", 
                                         threads = 1,
                                         step = 1000, # change to 500 for real data
                                         smooth = 3,
                                         genome = "hg38",
                                         index = "chr_cg",
                                         groupBy = "louvain_cluster",
                                         returnSumMatrix = TRUE, # save sum matrix for DMR analysis
                                         returnPctMatrix = TRUE)

Add the percent matrix output to the genomeMatrices slot. This will be used for visualizing methylation patterns over genomic regions with histograM or heatMap functions.

obj@tracks[["cg_cluster_tracks"]] <- cluster1kbwindows_cg[["pct_matrix"]]

Now you can view methylation patterns over key marker genes:

Note: Blue is hypomethylated. Legend has been ommitted for plot clarity. See legend with “legend = T”.

heatMap(obj, 
        genes = c("SLC17A7", # excitatory glutamatergic neurons
                  "DLX1", # inhibitory gabaergic neurons
                  "GAD1",  # inhibitory gabaergic neurons
                  "C1QA", # microglia
                  "MOG", # oligodendrocytes
                  "SLC1A3" # astrocytes
                  ), 
        track = "cg_cluster_tracks", 
        nrow = 3,
        legend = F)

The mCG patterns over these marker genes supports our hypotheses of cluster identity based on global %mCH.

If you want to simultaneously view another other information - like regulatory elements - these can be plotted as “tracks” below. In this example, we will use Encode candidate cis-regulatory elements (cCREs). The track structure should be a named list of data tables with chr, start, and end columns. First, download and split tracks into the expected structure.

# note this is for hg38 build only
library(rtracklayer)

download.file("https://hgdownload.soe.ucsc.edu/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb", "~/Downloads/encodeCcreCombined.bb") 
ccre <- as.data.table(rtracklayer::import("~/Downloads/encodeCcreCombined.bb"))
setnames(ccre, "seqnames", "chr")
ccre_tracks <- split(ccre, ccre$ucscLabel)
ccre_tracks <- lapply(ccre_tracks, function(x) {x[,.(chr, start, end)]})

Now plot tracks under the heatMaps:

heatMap(obj, 
        genes = "SLC1A3", 
        track = "cg_cluster_tracks", 
        legend = T,
        extraTracks = ccre_tracks)

As you can see from the heatMaps, promoters are sometimes universally hypomethylated or not at the predicted site. Methylation patterns over gene bodies are highly complex. Because of this, we prefer these tools to visualize over the whole gene body and surrounding context as opposed to aggregated metrics.

In addition to heatMap, you can also view methylation levels with the histograM function. Please see function documentation for a full list of which parameters can be changed.

histograM(obj, 
          baseline = "mean", # can plot from either mean or 0; see documentation
          orientation = "rows",
          genes = "SATB2", 
          track = "cg_cluster_tracks",
          legend = T)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

2. mCH global levels and hypo/hypermethylation over canonical marker genes

The brain uniquely has extremely high levels of mCH in addition to mCG, which can provide further information for cell type annotation. For example, glia have very little mCH, allowing for rapid distinction of neuronal vs glial groups. In addition, mCH has been shown to accumulate most variably across canonical marker genes distinguishing closely related subtypes. Gene body mCH accumulation is anticorrelated with gene expression.

To highlight differences between CG and CH, we will first look at broad levels over the whole genome at once with heatMapGenome. This function was constructed specifically for CH because levels uniquely fluctuate across megabase-scale regions thought to be guided by topologically-associated domain boundaries.

As a reference, we will first use heatMapGenome to look at 100kb levels in the CG context.

# First, aggregate CG levels by cluster
obj@genomeMatrices[["cg_100k_score_cluster"]] <- aggregateMatrix(obj = obj, matrix = "cg_100k_score", groupBy = "louvain_cluster", nmin = 2)

# Then plot
heatMapGenome(obj, matrix = "cg_100k_score_cluster", colors = c("#005eff", "#9daec9", "#cccccc", "#dbdbdb"))

Notice how there is subtle variation, but it relatively stable, particularly across groups. Next, we will plot the CH context.

# First, aggregate CH levels by cluster
obj@genomeMatrices[["ch_100k_pct_cluster"]] <- aggregateMatrix(obj = obj, matrix = "ch_100k_pct", groupBy = "louvain_cluster", nmin = 2)

# Then plot
heatMapGenome(obj, matrix = "ch_100k_pct_cluster", colors = c("black", "red", "yellow"))

Notice how: 1) groups 2-4 have very low global %mCH. These are likely glia (which we already determined). 2) There are megabase-scale fluctuations in CH levels, some of which are highly disparate between groups. It’s more apparent in a larger dataset. Let’s zoom in with heatMap to see an example. First, we must make cluster tracks for mCH, which was already done for you.

# don't run - already run and in workspace
cluster1kbwindows_ch <- calcSmoothedWindows(obj, 
                                         type = "CH", 
                                         threads = 0,
                                         step = 1000, # change to 500 for real data unless you have really low coverage
                                         smooth = 3,
                                         genome = "hg38",
                                         index = "chr_ch",
                                         groupBy = "louvain_cluster",
                                         returnSumMatrix = TRUE, # save sum matrix for DMR analysis
                                         returnPctMatrix = TRUE)

Add the result to the tracks slot

obj@tracks[["ch_cluster_tracks"]] <- cluster1kbwindows_ch[["pct_matrix"]]

Now use heatMap to zoom in on chr2:170500000-172800000. Notice the disparate mCH levels between groups 1 and 5 over DLX genes, GAD1, and RAPGEF4-AS1.

heatMap(obj, 
        regions = "chr2_170500000_172800000", 
        trackOverhang = 0,
        remove = "^ENS|^MT", # optional; uses grep to remove genes that may be outside the question of interest
        track = "ch_cluster_tracks")

Hopefully this helps illustrate how mCH fluctuates both over large-scale domains and gene boundaries. We will next calculate %mCH over protein coding genes. Again, this has been done for you.

# don't run - already run and in workspace
protein_coding <- unique(obj@ref |> dplyr::filter(type == "gene" & gene_type == "protein_coding" & seqid != "chrM") |> dplyr::pull(gene_name))
obj@genomeMatrices[["gene_ch"]] <- makeWindows(obj, genes = protein_coding, type = "CH", metric = "percent", threads = 1, index = "chr_ch", nmin = 2) 

We can then plot the mCH levels over gene bodies with functions like dotM. Because dotM accepts two matrices for aesthetics, we will first calculate a normalized %mCH level based on average gene values.

obj@genomeMatrices[["gene_ch_norm"]] <- sweep(obj@genomeMatrices[["gene_ch"]], 2, colMeans(obj@genomeMatrices[["gene_ch"]], na.rm = T), FUN="/") 
dotM(obj, genes = c("SATB2", "LINGO1", "SV2B", "CTTNBP2", "BROX", "GAD1", "DNER", "SLC6A1"), 
     groupBy = "louvain_cluster", sizeMatrix = "gene_ch", colorMatrix = "gene_ch_norm") + scale_size(range = c(1, 10))

Based on the variable levels of SATB2 (highly expressed in excitatory neurons) and GAD1 (highly expressed in inhibitory neurons), we can clearly tell from this dot plot that cluster 5 is inhibitory and 1 is excitatory neurons. Our manuscript also explores rare hyper-mCH genes in glia, such as CTTNBP2 in astrocytes (cluster 4) and BROX in oligodendrocytes (cluster 3). That leaves cluster 2 as likely microglia since they have hardly any mCH.

Let’s explore a couple other methods to view gene body %mCH.

dimM(obj, genes = c("SATB2", "GAD1"), reduction = "umap_irlba_ch100k_cg100k", matrix = "gene_ch", pointSize = .8, 
     squish = 7, colors = c("#dbdbdb", "#cccccc", "#265A61", "#0C1E20")) 
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

violinM(obj, genes = c("SATB2", "GAD1"), matrix = "gene_ch", groupBy = "louvain_cluster", colors = pal)

It can also be helpful to use a less directed approach when determining differences between groups. The findClusterMarkers function can be used for finding top cluster-specific differential mCH genes.

# don't run - already in workspace
cluster_markers <- findClusterMarkers(obj, 
                                      matrix = "gene_ch",
                                      groupBy = "louvain_cluster",
                                      method = "bonferroni", 
                                      genes = rownames(obj@genomeMatrices[["gene_ch"]]), 
                                      nmin = 5, # increase to at least 10 for larger datasets
                                      threads = 1)
cluster_markers <- cluster_markers |> dplyr::filter(p.adj < 0.05) 

To select top results, I like to use a combined rank metric of lowest adjusted p value and highest absolute logFC. This helps ameliorate artifacts generated by either approach.

top_markers_mCH <- cluster_markers |> 
  dplyr::group_by(cluster_id, direction) |> 
  dplyr::arrange(p.adj, .by_group = TRUE) |> dplyr::mutate(rank_padj = 1:n()) |>
  dplyr::arrange(desc(abs(logFC)), .by_group = TRUE) |> dplyr::mutate(rank_logFC = 1:n()) |>
  rowwise() |> dplyr::mutate(total_rank = sum(rank_padj, rank_logFC)) |> 
  group_by(cluster_id, direction) |> slice_min(n = 5, order_by = total_rank) 

Now plot some top results with dotM.

dotM(obj, genes = top_markers_mCH$gene, sizeMatrix = "gene_ch", colorMatrix = "gene_ch_norm", groupBy = "louvain_cluster")

Exploration of mCH over key marker genes is a powerful method for annotating brain data. It is the most neuronal subtype-specific epigenetic modality known.

3. Comparison to a reference

Another method for cell type annotation is by comparison to an annotated reference. While few exist, there are high-quality references available for brain data. We have aggregated mCH across key genes from PMC9004682. This reference can be downloaded from Github.

download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/brain_vignette/PMC9004682_5972genes_BA10_ref.RData", "~/Downloads/PMC9004682_5972genes_BA10_ref.RData", method = "curl")
ref <- readRDS("~/Downloads/PMC9004682_5972genes_BA10_ref.RData")
obj@genomeMatrices[["gene_ch_cluster"]] <- aggregateMatrix(obj, matrix = "gene_ch", groupBy = "louvain_cluster")

Now see how aggregated mCH profiles from the reference compare to those of our clusters

cor <- cor(merge(ref, 
                 obj@genomeMatrices[["gene_ch_cluster"]], 
                 by = 0) |> tibble::column_to_rownames(var = "Row.names"), use = "pairwise.complete.obs")
cor <- cor[c(1:ncol(ref)), c((ncol(ref) + 1)):ncol(cor)]
pheatmap(cor)

As previously suspected, cluster 1 is a mixed population of excitatory neurons, and cluster 5 inhibitory neurons. The other 3 are glia, which we also determined from global %mCH levels and %mCG over canonical marker genes. Based on all these annotation tools, we can rename our clusters according to broad class using dplyr logic:

obj@metadata[["type"]] <- dplyr::recode(obj@metadata[["louvain_cluster"]],
                                        "1" = "Exc", 
                                        "2" = "Micro",
                                        "3" = "Oligo",
                                        "4" = "Astro",
                                        "5" = "Inh")

group_labels <- merge(obj@metadata, obj@reductions[["umap_irlba_ch100k_cg100k"]], by = 0) |> dplyr::group_by(type) |> dplyr::summarise(dim_x = mean(dim_x), dim_y = mean(dim_y))

dimFeature(obj, colorBy = type, colors = pal, reduction = "umap_irlba_ch100k_cg100k", pointSize = .8) + 
  geom_text_repel(data = group_labels, aes(x = dim_x, y = dim_y, label = type))

You might also want cluster tracks with the group name:

obj@tracks[["cg_type_tracks"]] <- copy(obj@tracks[["cg_cluster_tracks"]])
setnames(obj@tracks[["cg_type_tracks"]], c("chr", "start", "end", "Astro", "Exc", "Micro", "Oligo", "Inh"))

obj@tracks[["ch_type_tracks"]] <- copy(obj@tracks[["ch_cluster_tracks"]])
setnames(obj@tracks[["ch_type_tracks"]], c("chr", "start", "end", "Astro", "Exc", "Micro", "Oligo", "Inh"))

heatMap(obj, 
        genes = c("SLC17A7", # excitatory glutamatergic neurons
                  "DLX1", # inhibitory gabaergic neurons
                  "GAD1",  # inhibitory gabaergic neurons
                  "C1QA", # microglia
                  "MOG", # oligodendrocytes
                  "SLC1A3" # astrocytes
                  ), 
        track = "cg_type_tracks", 
        nrow = 3,
        legend = F)

DIFFERENTIALLY METHYLATED REGION (DMR) ANALYSIS

There are two main formats to set up DMR analysis. The first is to test DMRs for each cluster against all others. Only the sum matrix (which we saved at the calcSmoothedWindows step) is needed. Here we are using the CH matrix, but this can directly be applied to CH as well.

dmrs_cg <- testDMR(cluster1kbwindows_cg[["sum_matrix"]], eachVsAll = TRUE, nminTotal = 5, nminGroup = 5) 
## 
## Finished group 4
## Finished group 1
## Finished group 2
## Finished group 3
## Finished group 5

Then expand and filter the resulting list according to the desired stringency.

dmrs_cg <- filterDMR(dmrs_cg, method = "bonferroni", filter = TRUE, pThreshold = 0.01, logThreshold = 1.5)
head(dmrs_cg)
##     chr   start     end test         pval   logFC         padj direction
## 1: chr1 1351000 1352000    1 1.348357e-12 -1.9606 1.777478e-05      hypo
## 2: chr1 1472000 1473000    1 6.826521e-14 -1.7574 8.999094e-07      hypo
## 3: chr1 2527000 2528000    1 7.827937e-12 -2.1370 1.031922e-04      hypo
## 4: chr1 2528000 2529000    1 6.432860e-14 -1.9806 8.480149e-07      hypo
## 5: chr1 2530000 2531000    1 2.383553e-13 -1.9886 3.142130e-06      hypo
## 6: chr1 2531000 2532000    1 1.673948e-18 -2.7884 2.206689e-11      hypo

Especially since the matrix is smoothed, adjacent genomic windows may be significant. You can collapse them with the following function. If annotation = T, any overlapping genes will be noted in the results table.

Note: Please pay attention to the maxDist and minLength parameters and decide if these are appropriate for your data. Here, we are combining any DMRs within 2kb of each other, and applying a minimum threshold of 2kb to count as a DMR.

collapsed_dmrs_cg <- collapseDMR(obj, dmrs_cg, maxDist = 4000, minLength = 2000, reduce = T, annotate = T) 
head(collapsed_dmrs_cg)
##     chr dmr_start dmr_end test direction dmr_length     dmr_padj dmr_logFC
## 1: chr1    997000 1000000    3      hypo       3000 1.228586e-08 -2.237933
## 2: chr1   1304000 1307000    4      hypo       3000 1.101757e-06 -1.670567
## 3: chr1   1305000 1308000    3     hyper       3000 4.898152e-08  1.869867
## 4: chr1   1305000 1309000    2      hypo       4000 7.382209e-04 -1.873633
## 5: chr1   1431000 1436000    4      hypo       5000 2.754107e-26 -2.174540
## 6: chr1   1630000 1632000    3      hypo       2000 6.139035e-13 -2.245600
##               gene_names
## 1: ENSG00000272512, HES4
## 2:                 ACAP3
## 3:                 ACAP3
## 4:          ACAP3, PUSL1
## 5:       LINC01770, VWA1
## 6:                  MIB2

The “test” column indicates which cluster is considered the member group. If you are testing a renamed matrix, you might want to add those names to your results instead of having the numerical order in which they were tested.

key <- data.frame(test = as.factor(1:5), 
                type = names(obj@tracks[["cg_type_tracks"]])[4:ncol(obj@tracks[["cg_type_tracks"]])]) # or just specify manually, but that can get tedious
collapsed_dmrs_cg <- left_join(collapsed_dmrs_cg, key, by = "test")
head(collapsed_dmrs_cg)
##     chr dmr_start dmr_end test direction dmr_length     dmr_padj dmr_logFC
## 1: chr1    997000 1000000    3      hypo       3000 1.228586e-08 -2.237933
## 2: chr1   1304000 1307000    4      hypo       3000 1.101757e-06 -1.670567
## 3: chr1   1305000 1308000    3     hyper       3000 4.898152e-08  1.869867
## 4: chr1   1305000 1309000    2      hypo       4000 7.382209e-04 -1.873633
## 5: chr1   1431000 1436000    4      hypo       5000 2.754107e-26 -2.174540
## 6: chr1   1630000 1632000    3      hypo       2000 6.139035e-13 -2.245600
##               gene_names  type
## 1: ENSG00000272512, HES4 Micro
## 2:                 ACAP3 Oligo
## 3:                 ACAP3 Micro
## 4:          ACAP3, PUSL1   Exc
## 5:       LINC01770, VWA1 Oligo
## 6:                  MIB2 Micro

If specific comparisons are desired, a data frame can be provided describing the tests. Three columns should be included: One listing members of group A, one listing members of group B, and one with the name of the test. As an example, we will test CH DMRs in excitatory vs. inhibitory neurons.

comparisons <- data.frame(
  stringsAsFactors = FALSE,
              name = c("exc_vs_inh"),
                 A = c("1"),
                 B = c("5")
)
dmrs_ch <- testDMR(sumMatrix = cluster1kbwindows_ch[["sum_matrix"]], comparisons = comparisons, nminTotal = 5, nminGroup = 5)
## 
## Finished testing exc_vs_inh: 1 vs. 5
dmrs_ch <- filterDMR(dmrs_ch, method = "bonferroni", filter = TRUE, pThreshold = 0.01, logThreshold = 2)
collapsed_dmrs_ch <- collapseDMR(obj, dmrs_ch, maxDist = 4000, minLength = 2000, reduce = T, annotate = T) 

EXPLORING DMR RESULTS

First, let’s look at how many DMRs were identified in each group.

ggplot(collapsed_dmrs_cg |> dplyr::group_by(type, direction) |> dplyr::summarise(n = n()), 
       aes(y = type, x = n, fill = type)) + geom_col() + 
  facet_grid(vars(direction), scales = "free_y") + scale_fill_manual(values = pal) + theme_classic()
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.

Select for top results based on combined lowest adjusted p and highest absolute logFC, like we did with findClusterMarkers.

top_dmrs_cg <- collapsed_dmrs_cg |> 
  dplyr::group_by(type, direction) |> 
  dplyr::arrange(dmr_padj, .by_group = TRUE) |> dplyr::mutate(rank_padj = 1:n()) |>
  dplyr::arrange(desc(abs(dmr_logFC)), .by_group = TRUE) |> dplyr::mutate(rank_logFC = 1:n()) |>
  rowwise() |> dplyr::mutate(total_rank = sum(rank_padj, rank_logFC)) |> 
  group_by(test, direction) |> slice_min(n = 1, order_by = total_rank) |>
  dplyr::mutate(location = paste0(chr, "_", (dmr_start - 2000), "_", (dmr_end + 2000))) |> dplyr::arrange(direction)

top_dmrs_ch <- collapsed_dmrs_ch |> 
  dplyr::group_by(direction) |> 
  dplyr::arrange(dmr_padj, .by_group = TRUE) |> dplyr::mutate(rank_padj = 1:n()) |>
  dplyr::arrange(desc(abs(dmr_logFC)), .by_group = TRUE) |> dplyr::mutate(rank_logFC = 1:n()) |>
  rowwise() |> dplyr::mutate(total_rank = sum(rank_padj, rank_logFC)) |> 
  group_by(test, direction) |> slice_min(n = 3, order_by = total_rank) |>
  dplyr::mutate(location = paste0(chr, "_", (dmr_start - 2000), "_", (dmr_end + 2000))) |> dplyr::arrange(direction)

Top DMRs for each group often fall over canonical marker genes, supporting the biological relevance of results. First let’s look at top CG DMRs. Let’s plot the regions as tracks underneath.

dmr_cg_hypo_tracks <- split(collapsed_dmrs_cg |> dplyr::filter(direction == "hypo") |> ungroup() |> dplyr::select(chr, dmr_start, dmr_end, type) |> dplyr::rename(start = dmr_start, end = dmr_end) |> data.table(), collapsed_dmrs_cg[collapsed_dmrs_cg$direction == "hypo"]$type)
                    
heatMap(obj, 
        track = "cg_type_tracks", 
        regions = top_dmrs_cg$location[top_dmrs_cg$direction == "hypo"], 
        nrow = 3, 
        arrowOverhang = 0,
        legend = F,
        extraTracks = dmr_cg_hypo_tracks,
        extraTrackColors = pal)

Now look at top CH DMRs. Plot the regions as tracks underneath.

dmr_ch_tracks <- split(collapsed_dmrs_ch |> dplyr::select(chr, dmr_start, dmr_end, test) |> dplyr::rename(start = dmr_start, end = dmr_end) |> data.table(), collapsed_dmrs_ch$test)

heatMap(obj, 
        track = "ch_type_tracks", 
        regions = top_dmrs_ch$location, 
        colorMax = 15,
        nrow = 3, 
        arrowOverhang = 0,
        extraTracks = dmr_ch_tracks)

Gene ontology (GO) analysis

Further interpretation of the results can be explored using a wide variety of packages available on R. In this example, we will use the topGO package to test for Gene Ontology (GO) term enrichments for genes with hypomethylated regions in the excitatory neuron group.

library(topGO)
library(org.Hs.eg.db)
background <- unique(obj@ref |> dplyr::filter(type == "gene" & gene_type == "protein_coding" & seqid != "chrM") |> dplyr::pull(gene_name))
query <- unlist(strsplit(collapsed_dmrs_cg$gene_names[collapsed_dmrs_cg$type == "Exc" & collapsed_dmrs_cg$direction == "hypo"], ", "))

GOdata <- new("topGOdata", 
              description = "GO Enrichment Analysis", 
              ontology = "BP", 
              allGenes = setNames(factor(as.integer(background %in% query), levels = c(0, 1)), background),
              geneSel = function(x) x == 1, 
              nodeSize = 10, 
              annot = annFUN.org, 
              mapping = "org.Hs.eg.db", 
              ID = "symbol")
resultElim <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
resultElim <- GenTable(GOdata, Fisher = resultElim, topNodes = 500, numChar = 60)
resultElim <- resultElim |> dplyr::filter(Fisher < 0.01 & Significant > 7) |> dplyr::mutate(fold_change = Significant/Expected, Fisher = as.numeric(Fisher)) |> dplyr::filter(fold_change > 2)
resultElim <- janitor::clean_names(resultElim)

As expected, top results are strongly related to excitatory synaptic transmission.

ggplot(resultElim, aes(x = fold_change, y = reorder(term, fold_change), fill = fisher)) + geom_col() + theme_classic() + scale_fill_viridis_c(direction = -1)

GO analysis can often produce hundreds of results. Selecting top candidates by either fold change or p value both have caveats. Cherry picking is misleading. To view results more holistically, packages like rrvgo can be helpful. We will first find parent terms and plot the proportion of results devoted to each with a treemapPlot. Notice how this helps illustrate that the vast majority of results are related to a couple terms with strong involvement in neuronal function.

suppressMessages(library(rrvgo))
simMatrix <- calculateSimMatrix(resultElim$go_id, orgdb="org.Hs.eg.db", ont="BP", method="Rel")
## preparing gene to GO mapping data...
## preparing IC data...
scores <- setNames(-log10(resultElim$fisher), resultElim$go_id)
reducedTerms <- reduceSimMatrix(simMatrix,scores, threshold=0.9, orgdb="org.Hs.eg.db")
treemapPlot(reducedTerms)

Finally, save your workspace so you don’t have to re-calculate everything in the future.

save.image("~/Downloads/brain_vignette_workspace.RData")

CONCLUSION

Thanks for trying out Amethyst! Additional utilities are still to come. We are also open to suggestions. If you use this package or code on our Github for your work, please cite our manuscript.

Good luck in your analysis!